JP920000447US1 



CLAIMS: 



1. \a method for clustering data points with defined quantified relationships between 
lem comprising the steps of: 

obtaining lead value for each data point either by deriving from said 
quantified relationships or as given input, 

ranking each data point in a lead value sequence list in descending order of 
sad value, 

assigning the first data point in said lead value sequence list as the leader of 
the iBrst cluster, and 

considering each subsequent data point in said lead value sequence list as a 
leader of a new cluster if its relationship with the leaders of each of the 
previous \:lusters is less than a defined threshold value or as a member of one 
or more clusters where its relationship with the cluster leader is more than or 
equal to said ^ir^hold value. 

2. The method as claimed in claim 1, wherein said relationships between data points are 
symmetric or asymmetric; 

3. The method as claimed in ^aim 1, v^erein the lead value of each data point is 
determined by taking the sum \)f relation values of each of Ae other data points to 
said data point \ ^^-^^^V^ ^ 

4. The method as claimed in claim 1, v^erein said threshold value is adaptively found 
for a given number of clusters, 

5. A method for organizing a set of data poihts into a hierarchy of clusters wherein the 
method claimed in claim 1 is first used to Vluster the data points into sets of small 
sizes, each smaller set is further subclusterea^ using the method and subclustering is 
repeated imtil a terminating condition is reache 

6. The method as claimed in claim 1 applied to text W[miarization of a single docxmient 
or a collection of documents comprising the steps of 
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Y segmenting the given input text into blocks such as sentences, a collection of 
\ sentences, paragraphs, 

- \ excluding words belonging to a defined list of 'stop' words, 

- \ replacing words by their unique synonymous word, if it exists, fi^om a given a 

\ collection of synonyms, 
Application of stemming algorithms for mapping words to root words, 
representing the resulting blocks of text, v^th respect to a dictionary which is 
either given or computed fi^om the input text, by a binary vector of size equal 
to thevnumber of words in the dictionary whose rth element is 1 if rth word in 
the dictionary is present in the block, 

computing the relationship between any data points di and dj by evaluating 
R(di,dj) =\ldj.Tdi|/|dj| wherein T is a thesaurus matrix whose //th element 
reflects the extent of inclusion of meaning of yth word in the meaning of rth 
word, and \ 

clustering the oata points wherein the lead value of each data point is 
determined by takang the sum of relation values of each of the other data 
points to said data point, the threshold value is adaptively found for a given 
number of clusters and the set of leaders of the resulting clusters simimarize 
the given text. \ 

7. The method as claimed in claim 6 werein said dictionary is computed by taking the 
fraction of words, excluding the stop words, v^th highest tfidf value, which is given 
by: \ 
tfidf(>v/) - * \og(N/ dfi) \ 

where Xfxdfiwi) is the lead value of data point wi, tfi = the number of times the data 
point wi occurred in the whole text, dfi = tnt number of documents containing the 
data point w/ and N = the total number of documents in the text. * 



8. The method as claimed in claim 6 wherein said thesaurus matrix is either a given, 
identity matrix or computed from a collection of documents. 
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The method as claimed in claim 6 wherein each block is represented by a vector 
v^4iose ith element represents the frequency of occurrence of ith word in the block. 

Asinethod for organizing a set of text documents into a hierarchy of clusters wherein 
the iiiethod claimed in claim 6 is first used to cluster the given documents into sets of 
small\ sizes, each smaller set is further subclustered using the method and 
subclustering is repeated until a terminating condition is reached. 

The method, as claimed in claim 10 applied to organize the results retximed by any 
information retrieval system in response to an user query into an hierarchy of clusters. 

The method as claimed in claim 1 1, wherein the hierarchy is used to aid the user in 
modifying his/her query and/or in browsing through the results. 

The method as claimeii in claim 11, wherein the information retrieval system is any 
search engine retrieving Web documents. 

The method as claimed in cmim 5, applied to vocabulary organization for a group of 
documents wherein the data pmnts are the words in the dictionary of the vocabulary, 
thejead value of a word is eitnfer its frequency of occurrence in the collection, the 
munber of documents contmning tj^e word or its tfidf value, the relationship R(di,dj) 
denotes the fraction of documents containing the yth word that also contain ith word, 
and the clustering produced by the application of the method results in a structured 
hierarchical organization of the vocabumry. 



pre 



e method as claimed in claim 14, wherein the structured vocabulary is used to 
provide text summarization for the associated documents. 



The method as claimed in claim 14 applieoVto customer profiling wherein the 
dictionary is built and the vocabulary is organised using the documents that are 
viewed by the customer. 
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17. \ The method as claimed in claim 5 wherein data points correspond to the products 

\ cataloged in the store, the lead value of a product is its per unit profit, its per unit 
Value or the number of items sold per irnit time, and the relationship between the 
products is either explicitly defined or derived fi'om the purchase data. 

1 8. The ni^od as claimed in claim 17 wherein the product di is related to the product dj 
by the firiction of customer transactions containing dj that also contain di. 

19. The methodVs claimed in 17 applied to analyze sales of a store for the merchant or to 
organize the layout of the store to facilitate easy access to products. 

20. The method as cmimed in 17 applied to personalize the electronic store layout to an 
individuial customeVby using the relationship that is specific to the customer. 

2L The method as claimeii in claim 5, applied to customer segmentation for a sales or 
service organization wherein the data points are the customers in the data base, the 
lead values are their total purchase amount per imit time, their income, the number of 
times customers visited the 5tore, or the number items bou^t by the customer, the 
relationship between customers is either explicitly defined or derived from some 
relevant data, with the resultiW clustering reflecting a structured grouping of 
customers with similar performance's. 

22. The method as claimed in claim 2iy wherein the customer di is related to the 
customer dj by the fi-action of products bimght by dj that are also bought by di. 

23. A system for clustering data points with aefined quantified relationships between 
them comprising: \ 

means for obtaining lead value for each oata point either by deriving firom said 
quantified relationships or as given input, \ 

means for ranking each data point in a lead Value sequence list in d^cending 
order of lead value, \ 

means for assigning tiie first data point in said Dead value sequence list as the 
leader of the first cluster, and \ 
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means for considering each subsequent data point in said lead value sequence 
list as a leader of a new cluster if its relationship with the leaders of each of 
the previous clusters is less than a defined threshold value or as a member of 
le or more clusters where its relationship with the cluster leader is more than 
or equal to said threshold value. 

24. The system as claimed in claim 23, wherein said relationships between data points are 
symmetric or asymjnetric. 



25. The system as claim^ in claim 23, wherein the means for obtaining lead value of 
each data point is by tal^g the simi of relation values of each of the other data points 
to said data point. 



26. The system as claimed in clai^ 23, wherein said threshold value is ad^rtively foimd 
for a given number of clusters. 



into blocks such as sentences, a 



27. The system for organizing a set of data points into a hierarchy of clusters wherein the 
system claimed in claim 23 is first iiSed to cluster the data points into sets of small 
sizes, each smaller set is further subclWered using the system and subclustering is 
repeated imtil a terminating condition is reached. 

28. The system as claimed in claim 23 used for te^ summarization of a single document 
or a collection of documents comprising: 

means for segmenting the given input te 
collection of sentences, paragraphs, 

means for excluding words belonging to a defined list of 'stop' words, 
means for replacing words by their unique synopymous word, if it exists, fi"om 
a given collection of synonyms, 

means for applying stemming algorithms for mapj^ng words to root words, 
means for representing the resulting blocks of text, with respect to a dictionary 
which is either given or computed from the input tekt, by a binary vector of 
size equal to the number of words in the dictionary >^%ose rth element is 1 if 
rth word in the dictionary is present in the block. 
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means for computing the relationship between any data points di and dj by 
evaluating R(di,dj) = |dj.Tdi|/|dj| wherein T is a thesaurus matrix whose ijih 
element reflects the extent of inclusion of meaning of yth word in the meaning 
\pf rth word, and 

for clustering the data points wherein the lead value of each data point 
is dtetermined by taking the sum of relation values of each of the other data 
points\to said data point, the threshold value is adaptively found for a given 
numberNof clusters and the set of leaders of the resulting clusters summarize 
the given text. 

29. The system as claimed in claim 28 wherein said dictionary is computed by taking the 
fraction of words, excluding the stop words, with highest tfidf value, which is given 
by means of: 

tfidf(wO - (// * \og(N/df 
where tfidf(w/) is the lead Value of data point wi, tfi = the number of times the data 
point wi occurred in the whole text, dfi = the number of documents containing the 
data point wi and N = the totaV^umber of documents in the text. 

30. The system as claimed in claim 28 wherein said thesaurus matrix is either a given 
identity matrix or computed from a opllection of documents. 

31. The system as claimed in claim 28 wherein each block is represented by a vector 
means whose rth element represents the^equency of occurrence of rth word in the 
block. 

32. A system for organizing a set of text documerJte into a hierarchy of clusters wherein 
the system claimed in claim 28 is first used to clupter the given documents into sets of 
small sizes, each smaller set is further subclustered using the system and the 
subclustering is repeated imtil a terminating condition is reached. 

33. The system as claimed in claim 32 used to organizb the results retumed by any 
information retrieval system in response to an user query \nto an hierarchy of clusters. 
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34. The system as claimed in claim 33, wherein the hierarchy of clusters is used to aid the 
\ user in modifying his/her query and/or in browsing through the results. 

35. Vrhe system as claimed in claim 33, wherein the information retrieval system is any 

search engine retrieving Web documents. 

36. The sVstem as claimed in claim 27, used for vocabulary organization for a group of 
documcKits wherein the data points are the words in the dictionary of the vocabulary, 
the lead value of a word is either its frequency of occurrence in the collection, the 
number of documents containing the word or its tfidf value, the relationship R(di,dj) 
denote the fraction of documents containing the yth word that also contain rth word, 
and the clustering produced by the system results in a structured hierarchical 
organization of the vocabulary. 

37. The system as claimed in claim 36, wherein the structured vocabulary organization is 
used to provide text summarization for the associated documents. 

38. The system as claimed in claim 36 used for customer profiling wherein the dictionary 
is built and the vocabulary i« organized using the documents that are viewed by the 
customer. \ 

39. The system as claimed in claim ^7 wherein data points correspond to the products 
cataloged in the store, the lead value of a product is its per unit profit, its per unit 
value or the number of items sold pervunit time, the relationship between the products 
is either explicitly defined or derived from the piu-chase data. 

40. The system as claimed in claim 39 wherem the product di is related to the product dj 
by the fraction of customer transactions containing dj that also contain di. 

41. The system as claimed in claim 39 used for analyzing sales of a store for the merchant 
or for organizing the layout of the store to facilitMe easy access to products. 
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The system as claimed in 39 used to personalize the electronic store layout to an 
individual customer by using the relationship that is specific to t|ie customer. 

uhe systepi as claimed in cl^im 27, used for customer segpientation fpr a sales or 
service organization wherein the data points are the customers in the d^ta base, the 
leaasyalues are their total purchase amount per unit time, their income, the number of 
times ^customers visited the store, or the number items bought by the customer, the 
relationmip between customers is either explicitly defined or derived from some 
relevant data, with the resulting clustering reflecting a structured grouping of 
customers with similar performances. 

The system as maimed in claim 43, wherein the customer di is related to the customer 
dj by the fractioir\pf products bought by dj that are also bought by di. 

A computer program product comprising computer readable program code stored on 
computer readable stOTage medium embodied therein for clustering data points with 
defined quantified relatronships between them, comprising: 

computer readable, program code means configured for obtaining lead value 
for each data point either by deriving from said quantified relationships or as 
given input, \ 

computer readable program code means configured for ranking each data 
point in a lead value sequence list in descending order of lead value, 
computer readable program code means configured for assigning the first data 
point in said lead value sequence list as the leader of the first cluster, and 
computer readable program Vode means configured for considering each 
subsequent data point in said lead value sequence list as a leader of a new 
cluster if its relationship with the leaders of each of the previous clusters is 
less than a defined threshold value^or as a member of one or more clusters 
where its relationship with the cluster leader is more than or equal to said 
threshold value. \ 

The computer program product as claimed in cliaim 45, wherein said relationships 
between data points are symmetric or asymmetric. \ 
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47. \ The computer program product as claimed in claim 45, wherein said computer 
V readable program code means configured for obtaining lead value of each data point 
i^by taking the sum of relation values of each of the other data points to said data 

pomt. 

48. The computer program product as claimed in claim 45, wherein said threshold value 
is adaptively found for a given number of clusters. 

49. A computerWogram product for organizing a set of data points into an hierarchy of 
clusters wherein the computer program product claimed in claim 45 is first used to 
cluster the dataVoints into sets of small sizes, each smaller set is further subclustered 
using the computer program product and the subclustering is repeated until a 
terminating condition is reached. 



50. The computer prografn product as claimed in claim 45 configured for text 
summarization of a single\document or a collection of documents comprising: 

computer readable m-ogram code means configured for segmenting the given 
input text into blocks \uch as sentences, a collection of sentences, paragraphs, 
computer readable program code means configured for excluding words 
belonging to a defined lis\ of 'stop' words, 

computer readable program code means configured for replacing words by 
their unique synonymous w^rd, if it exists, fi-om a given a collection of 
synonyms, v 
computer readable program code means configured for applying stemming 
algorithms for mapping words to root words, 

computer readable program codeVneans configured for representing the 
resulting blocks of text, with respect to a dictionary which is either given or 
computed from the input text, by a bin^ vector of size equal to the number 
of words in the dictionary whose rth elemt^nt is 1 if rth word in the dictionary 
is present in the block, 
computer readable program code means Configured for computing the 
relationship between any data points di and\dj by evaluating R(di,dj) = 
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\ |dj.Tdi|/|dj| wherein T is a thesaurus matrix whose ijth element reflects the 

\ extent of inclusion of meaning of Jth word in the meaning of rth word, and 
- \ computer readable program code means configured for clustering the data 
\ points wdierein the lead value of each data point is determined by taking the 
\um of relation values of each of the other data points to said data point, the 
thi^shold value is adaptively found for a given number of clusters and the set 
of leaders of the resulting clusters simimarize the given text. 

51. The computer program product as claimed in claim 50 wherein said dictionary is 
computed by takink the fraction of words, excluding the stop words, with highest tfidf 
value which is given W: 

tfidf(vt;/) = (// * \og(N/\m) 

where tfidf(w/) is the lead value of data point wi, (/? = the number of times the data 
point wi occurred in the whole text, <^ = the number of docimients containing the 
data point wi and N = the total number of documents in the text. 

52. The computer program product as claimed in claim 50 wherein said thesaurus matrix 
is either a given identity matrix or Computed from a collection of documents. 

53. The computer program product as Vlaimed in claim 50 wherein each block is 
represented by a vector computer readable program code means, whose rth element 
represent the frequency of occurrence of im word in the block. 

54. The computer program product for organizing V set of text documents into a hierarchy 
of clusters wherein the computer program product claimed in claim 50 is first used to 
cluster the given documents into sets of small \izes, each smaller set is further 
subclustered using the computer program product abd the subclustering is repeated 
until a terminating condition is reached. \ 

55. The computer program product as claimed in claim 54 coraigured for organizing the 
results returned by any information retrieval system in response to an user query into 
an hierarchy of clusters. \ 
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The computer program product as claimed in claim 55, wherein the hierarchy of 
clusters is used to aid the user in modifying his/her query and/or in browsing through 
the results. 

57. Thk computer program product as claimed in claim 55, wherein the information 
retrieWl system is any search engine retrieving Web documents. 

58. The computer program product as claimed in claim 49, configured for vocabulary 
organizationyor a group of documents wherein the data points are the words in the 
dictionary of the vocabulary, the lead value of a word is either its frequency of 
occurrence in the. collection, the number of documents containing the word or its tfidf 
value, the relationship R(dUdj) denote the fraction of documents containing the yth 
word that also contairi rth word, and the clustering produced by the computer readable 
program code means\results in a structured hierarchical organization of the 
vocabulary. 

59. The computer program pro^ct as claimed in claim 58, wherein the structured 
vocabulary organization is usea^ to provide text summarization for the associated 
documents. 

60. The computer program product as claimed in claim 58 configured for customer 
profiling wherein the dictionary is builtyand the vocabulary is organized using the 
documents that viewed by the customer. 

61. The computer program product as claimedXin claim 49 wherein data points 
correspond to the products cataloged in the store, the lead value of a product is its per 
unit profit, its per unit value or the number ol^ items sold per unit time, the 
relationship between the products is either explicitly defined or derived from the 
purchase data. 

62. The computer program product as claimed in claim 61 wherein the product di is 
related to the product dj by the fraction of customer transactions containing dj that 
also contain di. 
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63 A The computer program product as claimed in claim 61 configured for analyzing sales 
\ of a store for the merchant or for organizing the layout of the store to facilitate easy 
Recess to products. 

64. The\oraputer program product as claimed in 61 configured for personalizing the 
electronic store layout to an individual customer by using the relationship that is 
specific to the customer. 

65. The computer ^gram product as claimed in claim 49, configured for customer 
segmentation for k sales or service organization wherein the data points are the 
customers in the data\base, the lead values are their total purchase amount per unit 
time, their income, the Vumber of times customers visited the store, or the nimiber 
items bought by the customer, the relationships between customers is either explicitly 
defined or derived firom som\ relevant data, with the resulting clustering reflecting a 
structured grouping of customers with similar performances. 

66. The computer program product as claimed in claim 65, wherein the customer di is 
related to the customer dj by the fiactio^ of products bought by dj that are also bought 
by di. 
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